DISTRIBUTED APPROACH to WEB PAGE CATEGORIZATION USING MAP- REDUCE PROGRAMMING MODEL
نویسندگان
چکیده
The web is a large repository of information and to facilitate the search and retrieval of pages from it, categorization of web documents is essential. An effective means to handle the complexity of information retrieval from the internet is through automatic classification of web pages. Although lots of automatic classification algorithms and systems have been presented, most of the existing approaches are computationally challenging. In order to overcome this challenge, we have proposed a parallel algorithm, known as MapReduce programming model to automatically categorize the web pages. This approach incorporates three concepts. They are web crawler, MapReduce programming model and the proposed web page categorization approach. Initially, we have utilized web crawler to mine the World Wide Web and the crawled web pages are then directly given as input to the MapReduce programming model. Here the MapReduce programming model adapted to our proposed web page categorization approach finds the appropriate category of the web page according to its content. The experimental results show that our proposed parallel web page categorization approach achieves satisfactory results in finding the right category for any given web page.
منابع مشابه
MapReduce K-Means based Co-Clustering Approach for Web Page Recommendation System
Co-clustering is one of the data mining techniques used for web usage mining. Co-clustering Web log data is the process of simultaneous categorization of both users and pages. It is used to extract the users’ information based on subset of pages. Nowadays, the cyberspace is filled with huge volume of data distributed across the world. The business knowledge acquaintance from such a voluminous d...
متن کاملMap Reduce Text Clustering Using Vector Space Model
Information retrieval is the area of finding particular web pages via a query to an internet search engine. Even though well sophisticated algorithms and data structures are used in traditional computer techniques to create indexes for efficiently organize and retrieve information systems, currently data mining techniques like clustering are used to enhance the efficiency of retrieval process. ...
متن کاملLearning Structural Classification Rules for Web-Page Categorization
Content-related metadata plays an important role in the effort of developing intelligent web applications. One of the most established form of providing contentrelated metadata is the assignment of web-pages to content categories. We describe the Spectacle system for classifying individual web pages on the basis of their syntactic structure. This classification requires the specification of cla...
متن کاملMultilabel Classification of Documents with Mapreduce
Multilabel classification is the problem of assigning a set of positive labels to an instance and recently it is highly required in applications like protein function classification, music categorization, gene classification and document classification for easy identification and retrieving of information. Labeling the documents of the web manually is a time consuming and a difficult task due t...
متن کاملUsing neighborhood information for automated categorization of Web pages
In this paper we discuss several issues related to the influence of expansion of a Web document representation on quality of topical categorization of Web pages. We consider a Web page expansion by using text content of it’s linking pages. We show that naive expansion can grab too much noise and essentially harm categorization results. We present the approach to automated pruning of linking Web...
متن کامل